53 research outputs found
The efficiency of logistic regression compared to normal discriminant analysis under class-conditional classification noise
AbstractIn many real world classification problems, class-conditional classification noise (CCC-Noise) frequently deteriorates the performance of a classifier that is naively built by ignoring it. In this paper, we investigate the impact of CCC-Noise on the quality of a popular generative classifier, normal discriminant analysis (NDA), and its corresponding discriminative classifier, logistic regression (LR). We consider the problem of two multivariate normal populations having a common covariance matrix. We compare the asymptotic distribution of the misclassification error rate of these two classifiers under CCC-Noise. We show that when the noise level is low, the asymptotic error rates of both procedures are only slightly affected. We also show that LR is less deteriorated by CCC-Noise compared to NDA. Under CCC-Noise contexts, the Mahalanobis distance between the populations plays a vital role in determining the relative performance of these two procedures. In particular, when this distance is small, LR tends to be more tolerable to CCC-Noise compared to NDA
Isoform-level gene signature improves prognostic stratification and accurately classifies glioblastoma subtypes.
Molecular stratification of tumors is essential for developing personalized therapies. Although patient stratification strategies have been successful; computational methods to accurately translate the gene-signature from high-throughput platform to a clinically adaptable low-dimensional platform are currently lacking. Here, we describe PIGExClass (platform-independent isoform-level gene-expression based classification-system), a novel computational approach to derive and then transfer gene-signatures from one analytical platform to another. We applied PIGExClass to design a reverse transcriptase-quantitative polymerase chain reaction (RT-qPCR) based molecular-subtyping assay for glioblastoma multiforme (GBM), the most aggressive primary brain tumors. Unsupervised clustering of TCGA (the Cancer Genome Altas Consortium) GBM samples, based on isoform-level gene-expression profiles, recaptured the four known molecular subgroups but switched the subtype for 19% of the samples, resulting in significant (P = 0.0103) survival differences among the refined subgroups. PIGExClass derived four-class classifier, which requires only 121 transcript-variants, assigns GBM patients' molecular subtype with 92% accuracy. This classifier was translated to an RT-qPCR assay and validated in an independent cohort of 206 GBM samples. Our results demonstrate the efficacy of PIGExClass in the design of clinically adaptable molecular subtyping assay and have implications for developing robust diagnostic assays for cancer patient stratification
Tree-Based Position Weight Matrix Approach to Model Transcription Factor Binding Site Profiles
Most of the position weight matrix (PWM) based bioinformatics methods developed to predict transcription factor binding sites (TFBS) assume each nucleotide in the sequence motif contributes independently to the interaction between protein and DNA sequence, usually producing high false positive predictions. The increasing availability of TF enrichment profiles from recent ChIP-Seq methodology facilitates the investigation of dependent structure and accurate prediction of TFBSs. We develop a novel Tree-based PWM (TPWM) approach to accurately model the interaction between TF and its binding site. The whole tree-structured PWM could be considered as a mixture of different conditional-PWMs. We propose a discriminative approach, called TPD (TPWM based Discriminative Approach), to construct the TPWM from the ChIP-Seq data with a pre-existing PWM. To achieve the maximum discriminative power between the positive and negative datasets, the cutoff value is determined based on the Matthew Correlation Coefficient (MCC). The resulting TPWMs are evaluated with respect to accuracy on extensive synthetic datasets. We then apply our TPWM discriminative approach on several real ChIP-Seq datasets to refine the current TFBS models stored in the TRANSFAC database. Experiments on both the simulated and real ChIP-Seq data show that the proposed method starting from existing PWM has consistently better performance than existing tools in detecting the TFBSs. The improved accuracy is the result of modelling the complete dependent structure of the motifs and better prediction of true positive rate. The findings could lead to better understanding of the mechanisms of TF-DNA interactions
IsoformEx: isoform level gene expression estimation using weighted non-negative least squares from mRNA-Seq data
<p>Abstract</p> <p>Background</p> <p>mRNA-Seq technology has revolutionized the field of transcriptomics for identification and quantification of gene transcripts not only at gene level but also at isoform level. Estimating the expression levels of transcript isoforms from mRNA-Seq data is a challenging problem due to the presence of constitutive exons.</p> <p>Results</p> <p>We propose a novel algorithm (IsoformEx) that employs weighted non-negative least squares estimation method to estimate the expression levels of transcript isoforms. Validations based on <it>in silico </it>simulation of mRNA-Seq and qRT-PCR experiments with real mRNA-Seq data showed that IsoformEx could accurately estimate transcript expression levels. In comparisons with published methods, the transcript expression levels estimated by IsoformEx showed higher correlation with known transcript expression levels from simulated mRNA-Seq data, and higher agreement with qRT-PCR measurements of specific transcripts for real mRNA-Seq data.</p> <p>Conclusions</p> <p>IsoformEx is a fast and accurate algorithm to estimate transcript expression levels and gene expression levels, which takes into account short exons and alternative exons with a weighting scheme. The software is available at <url>http://bioinformatics.wistar.upenn.edu/isoformex</url>.</p
Recommended from our members
Identifying the substrate proteins of U-box E3s E4B and CHIP by orthogonal ubiquitin transfer
E3 ubiquitin (UB) ligases E4B and carboxyl terminus of Hsc70-interacting protein (CHIP) use a common U-box motif to transfer UB from E1 and E2 enzymes to their substrate proteins and regulate diverse cellular processes. To profile their ubiquitination targets in the cell, we used phage display to engineer E2-E4B and E2-CHIP pairs that were free of cross-reactivity with the native UB transfer cascades. We then used the engineered E2-E3 pairs to construct “orthogonal UB transfer (OUT)” cascades so that a mutant UB (xUB) could be exclusively used by the engineered E4B or CHIP to label their substrate proteins. Purification of xUB-conjugated proteins followed by proteomics analysis enabled the identification of hundreds of potential substrates of E4B and CHIP in human embryonic kidney 293 cells. Kinase MAPK3 (mitogen-activated protein kinase 3), methyltransferase PRMT1 (protein arginine N-methyltransferase 1), and phosphatase PPP3CA (protein phosphatase 3 catalytic subunit alpha) were identified as the shared substrates of the two E3s. Phosphatase PGAM5 (phosphoglycerate mutase 5) and deubiquitinase OTUB1 (ovarian tumor domain containing ubiquitin aldehyde binding protein 1) were confirmed as E4B substrates, and b-catenin and CDK4 (cyclin-dependent kinase 4) were confirmed as CHIP substrates. On the basis of the CHIP-CDK4 circuit identified by OUT, we revealed that CHIP signals CDK4 degradation in response to endoplasmic reticulum stress
Distinct mechanisms control genome recognition by p53 at its target genes linked to different cell fates.
The tumor suppressor p53 integrates stress response pathways by selectively engaging one of several potential transcriptomes, thereby triggering cell fate decisions (e.g., cell cycle arrest, apoptosis). Foundational to this process is the binding of tetrameric p53 to 20-bp response elements (REs) in the genome (RRRCWWGYYYN0-13RRRCWWGYYY). In general, REs at cell cycle arrest targets (e.g. p21) are of higher affinity than those at apoptosis targets (e.g., BAX). However, the RE sequence code underlying selectivity remains undeciphered. Here, we identify molecular mechanisms mediating p53 binding to high- and low-affinity REs by showing that key determinants of the code are embedded in the DNA shape. We further demonstrate that differences in minor/major groove widths, encoded by G/C or A/T bp content at positions 3, 8, 13, and 18 in the RE, determine distinct p53 DNA-binding modes by inducing different Arg248 and Lys120 conformations and interactions. The predictive capacity of this code was confirmed in vivo using genome editing at the BAX RE to interconvert the DNA-binding modes, transcription pattern, and cell fate outcome
The efficiency of logistic regression compared to normal discriminant analysis under class-conditional classification noise
In many real world classification problems, class-conditional classification noise (CCC-Noise) frequently deteriorates the performance of a classifier that is naively built by ignoring it. In this paper, we investigate the impact of CCC-Noise on the quality of a popular generative classifier, normal discriminant analysis (NDA), and its corresponding discriminative classifier, logistic regression (LR). We consider the problem of two multivariate normal populations having a common covariance matrix. We compare the asymptotic distribution of the misclassification error rate of these two classifiers under CCC-Noise. We show that when the noise level is low, the asymptotic error rates of both procedures are only slightly affected. We also show that LR is less deteriorated by CCC-Noise compared to NDA. Under CCC-Noise contexts, the Mahalanobis distance between the populations plays a vital role in determining the relative performance of these two procedures. In particular, when this distance is small, LR tends to be more tolerable to CCC-Noise compared to NDA.Class noise Misclassification rate Misspecified model Asymptotic distribution
- …